16 research outputs found

    Evaluation of automatic hypernym extraction from technical corpora in English and Dutch

    Get PDF
    In this research, we evaluate different approaches for the automatic extraction of hypernym relations from English and Dutch technical text. The detected hypernym relations should enable us to semantically structure automatically obtained term lists from domain- and user-specific data. We investigated three different hypernymy extraction approaches for Dutch and English: a lexico-syntactic pattern-based approach, a distributional model and a morpho-syntactic method. To test the performance of the different approaches on domain-specific data, we collected and manually annotated English and Dutch data from two technical domains, viz. the dredging and financial domain. The experimental results show that especially the morpho-syntactic approach obtains good results for automatic hypernym extraction from technical and domain-specific texts

    Noise or music? Investigating the usefulness of normalisation for robust sentiment analysis on social media data

    Get PDF
    In the past decade, sentiment analysis research has thrived, especially on social media. While this data genre is suitable to extract opinions and sentiment, it is known to be noisy. Complex normalisation methods have been developed to transform noisy text into its standard form, but their effect on tasks like sentiment analysis remains underinvestigated. Sentiment analysis approaches mostly include spell checking or rule-based normalisation as preprocess- ing and rarely investigate its impact on the task performance. We present an optimised sentiment classifier and investigate to what extent its performance can be enhanced by integrating SMT-based normalisation as preprocessing. Experiments on a test set comprising a variety of user-generated content genres revealed that normalisation improves sentiment classification performance on tweets and blog posts, showing the model’s ability to generalise to other data genres

    LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit

    Get PDF
    This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules including Part-of-Speech Taggers, Lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German. We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe the data and methods used to train each component. Ten-fold cross-validation results are also presented. To assess the performance of each module on different domains, we collected real-world textual data from companies covering various domains (a.o. automotive, dredging and human resources) for all four supported languages. For this multi-domain corpus, a manually verified gold standard was created for each of the three preprocessing steps. We present the performance of our preprocessing components on this corpus and compare it to the performance of other existing tools. 1

    Terminologie: op het snijvlak van ambacht en technologie

    Get PDF
    Het artikel geeft een overzicht van de activiteiten en projecten binnen het vakgebied van de terminologie in de vakgroep VTC en zijn voorgangers. Zowel terminografische projecten als taaltechnologische toepassingen en termextractie komen aan bod

    The good, the bad and the implicit: a comprehensive approach to annotating explicit and implicit sentiment

    Get PDF
    We present a fine-grained scheme for the annotation of polar sentiment in text, that accounts for explicit sentiment (so-called private states), as well as implicit expressions of sentiment (polar facts). Polar expressions are annotated below sentence level and classified according to their subjectivity status. Additionally, they are linked to one or more targets with a specific polar orientation and intensity. Other components of the annotation scheme include source attribution and the identification and classification of expressions that modify polarity. In previous research, little attention has been given to implicit sentiment, which represents a substantial amount of the polar expressions encountered in our data. An English and Dutch corpus of financial newswire, consisting of over 45,000 words each, was annotated using our scheme. A subset of this corpus was used to conduct an inter-annotator agreement study, which demonstrated that the proposed scheme can be used to reliably annotate explicit and implicit sentiment in real-world textual data, making the created corpora a useful resource for sentiment analysis

    The good, the bad and the implicit: annotating polarity

    No full text
    Most of the existing sentiment annotation schemes focus on the identification of subjective statements, which explicitly express an evaluation of a certain target. Subjective statements are particularly common in user-generated content such as user reviews or blogs. However, we find that these annotation schemes are insufficient for capturing all occurrences of sentiment, which is often expressed in an implicit way. This is true especially in factual text types such as newswire, where explicit sentiment expressions are rare. We therefore propose a new annotation scheme for the fine-grained analysis of explicit as well as implicit expressions of positive and negative sentiment, also called polar expressions. This scheme was applied to a corpus of economic news articles by 8 annotators. In this presentation, we discuss the annotation scheme and the results of the annotation effort, including inter annotator agreement

    HypoTerm detection of hypernym relations between domain-specific terms in Dutch and English

    No full text
    HypoTerm is a data-driven semantic relation finder that starts from a list of automatically extracted domain- and user-specific terms from technical corpora, and generates a list of relations between these terms. This research study focused on the detection of hypernym relations between relevant terms and named entities. In order to detect all relevant hypernym relations in technical texts, we combined a lexico-syntactic pattern-based approach and a morpho-syntactic analyzer. To evaluate our relation finder, we constructed and manually annotated gold standard data for the dredging and financial domain in Dutch and English. The experimental results show that the HypoTerm system achieves high precision and recall figures for technical texts when starting from valid domain-specific terms and named entities. Thanks to this data-driven approach, it is possible to take an important step from terminology to concept extraction without using any external lexico-semantic resources
    corecore